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(54) Method for address lookup 

(57) An efficient method of storing prefixes related 
to addresses in a binary trie fashion wherein no each 
node in the tree has a prefix stored in it and no node is 
empty. Methods for searching, inserting and deleting. A 
networic system with prefixes of network addresses 
stored In a binary trie fashion wherein no each node in 
the tree has a prefix stored in it and no node is empty. 
Fiast longest matching prefix lookup, efficient memory 
usage (one node per prefix), and fully dynamic opera- 
tion are supported. A greedy algorithm that calculates 
the binary trie of the present Invention with minimum 
overall depth. A dynamk: programming approach that 
constructs the binary trie of the present invention with 
the minimal expected number of search steps, based on 
an arbitrary distribution of destinatton IP addresses. A 
pipelined hardware structure for the binary trie of the 
present invention, providing a throughput of one longest 
prefix match per nr^mory access time, with insert and 
delete operations requiring no more than two dock 
cycle stalls in the pipeline. 
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Description 

I A. Reld of the Invention 

5 [0001] This invention relates to storing prefixes related to addresses efficiently. Specifically, this invention relates to 
storing prefixes related to addresses in a binary trie fashion wherein each node in the trie has a prefix stored and no 
node is empty. The present invention is embodied in a method of storing prefixes related to addresses in a binary trie 
fashion; in a method of storing prefixes related to networi< addresses in a binary trie fashion wherein each node in the 
trie has a prefix stored and no node is empty; in a networidng system where network addresses are stored in a binary 

10 trie fashion wherein each node in the trie has a prefix stored and no node is empty, and a computer program product 
which enables a computer to store addresses in a binary trie fashion wherein each node in the trie has a prefix stored 
and no node is empty. 

IB. Background of the Invention 

15 

[0002] Storing addresses and prefixes related to addresses efficiently is important for any system that uses multiple 
addresses. It should be noted that this background section discusses forwarding tat)tes associated with routers used in 
many Internet applications. However, the techniques and principles discussed apply to any system where a table is 
required to store prefixes related to multiple addresses. In networking systems, a large number of network addresses 
20 need to be stored efficiently. 

[0003] IP addresses typk:alty have 32 brts. An IP datagram contains both a source and a destination IP address. At 
a router, an incoming IP datagram must be fonwarded to the next to hop, whk^h is typteally some neighboring machine. 
The router decides the next hop by consulting it routing table. This procedure is called IP forwarding, or table lookup. It 
should be noted that Ibnwarding Is distinct from computing the routes, whk:h can be call routing and is handled by rout- 
es ing algorithm. IP forararding is sometimes the most time-consuming task for a typical datagram. 

[0004] Fonvarding of datagrams in recent versions of IP relies on the storage of a set of IP address prefixes, each 
address being associated with a next hop within the networidng system using IP. When an IP datagram arrives at a 
router within the networking system, the destination address is matched against the prefixes stored in the forwarding 
table assodated with the router. The longest prefix that matches the destination IP address Is found, and the next-hop 
30 infomnation associated with that prefix is used for fonvarding the datagram. This problem is called the Longest Matching 
Prefix (LMP) problem. As can be readily appreciated, the efficient storage of prefixes related to addresses is an impor- 
tant factor in addressing the LMP problem. 

[0005] A conventional technique of storing prefixes in the fonvarding table associated with a router in a networking 
system using IP is known as class-less Inter-Domain Routing (CIDR). See V. Fuller, T. Li, J. Yu, and K. Varadhan, 

35 "Classless inter-domain routing (CIDR): An address assignment and aggregation strategy," RFC-1519, September 
1 993. The intent of this approach is to reduce the size of tables storing addresses within the Internet The IP forwarding 
approaches prior to CDIR relied on fixing the address formats so that destination network number could easily be 
extracted. Datagrams were fonwarded to the next hop assodated with each destination networic. One forwarding table 
entry per network is required on each router. Such a storage requirement became problematc as the number of net- 

40 works on the Internet expanded. CIDR reduces the size of these fonvarding tables by grouping IP addresses with the 
same next hop infomnation under a single prefix, when such aggregation is possible. 
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[0006] As an example of the above-mentioned fonvarding process, consider Table 1 , which contains a set of pre- 
fixes, each associated with a next hop. (The next-hop infomnation will typically consist of the next router's IP address 
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the outgoing physical interface.) For example, if an I P datagram has a destination address of "4. 1 23.33. 1 2", the leftmost 
eight bits are "000001 00". For this address, the LMP from Table 1 is "0000", indicating a next hop of Hq. Another exam- 
ple is destination address "1 09. 12. 1 2.1 2", with leftmost eight bits "01 1 01 1 01 This address matches both "01 1 01 " and 
•01 1011"; since the latter is the longer prefix, the next hop is H3. 

5 [0007] A significant amount of earlier research on information storage and retrieval is applicable to the LMP prob- 
lem, usually with only minor nrtodifications. In particular, search approaches that rely on the binary representation of 
keys, rather than direct comparison of keys, have proven to be popular in this context. Knuth provides an overview of a 
variety of such "digital searching" approaches. See D. E. Knuth, The Art of Computer Programming: Volume 3, Sorting 
and Searching. Addison Wesley, second ed., 1998. 

10 [0008] A trie structure is a kind of tree structure where branching at any level is determined by only a portion of the 
value stored in the nodes in the trie. Fredkin's trie structure is elegant, but suffers from ineffk^ient memory usage, poten- 
tially requiring far more nodes than stored prefixes. See E. Fredkin, "Trie memory," Communteations of the ACM, vol. 3, 
pp. 490- 500, 1 960. Morrison's Patricia trie remedies this problem by removing each trie node that is not associated with 
a table entrie and has only one child. See D. Monison, "Patrida-practfcal algorithm to retrieve information coded in 

15 alphanumeric," Journal of the ACM, vol. 1 5. no. 4, pp. 51 5-534; October 1 968. These two structures, proposed by Fred- 
kin and Morrison respectively, have influenced much of the recent woric on IP forwarding. FIG. 1 9 shows an example of 
a conventional trie structure. FIG. 20 shows an example of a conventional trie structure and an equivalent conventional 
Patricia trie structure. In FIGs. 1 9-20, the daricened nodes store a prefix and the non-daricened nodes do not have any 
prefix stored. 

20 [0009] Recent proposals for dealing with IP fonwarding have been optimized with different design goals in mind. 
Several of these approaches emphasize the speed of the lookup (i.e., search) rather than updates to the tables(i.e., 
insert and delete). See M. DegemnarK A, Brodnik, S. Carisson, and S. Pink, "Snnall forwarding tables for fast routing 
lookups," in Proceedings ACM SIGCOMM'97, pp. 3- 14, 1997; B. Lampson, V. Srinivasan, and G. Varghese, "IP 
lookups using multiway and muiticolumn search," in Proceedings IEEE INFOCOM'98, pp. 1248- 1256, 1998; S. Nilsson 

25 and G. Karisson, "Fast address lookup for Intemet routers," in Proceedings of IEEE Broadband Communications 98. 
April 1998; H.H.-Y. Tzeng, "Longest prefix search using compresses trees." in GLOBE-COM'98, Global Internet Mini 
Conference, pp. 88-93, November 1998; and M. Waldvogel, G. Nferghese, J. Turner, and B. Plattner, "Scalable high 
speed IP routing lookups," in Proceedings ACM SIGCOMM'97, pp. 25-36, 1997. 

[001 0] The reason for emphasizing the speed of the lookup is that although routing updates are fairly frequent, rout- 
30 ing protocols can take several minutes to accommodate an update; forwarding tables on any particular router do not 
need to be changed more than at most once per second for current systems. One therefore envisions the use of some 
dynamk; routing table structure elsewhere on the router, which periodically updates the forwarding tables. 
[0011] The cun-ent emphasis on search speed leads to an unbalanced design, one that is out of step with current 
and future needs of the Intemet Labovitz et al., point out that Internet core routers typteally exchanged between three 
35 to six million updates per day in 1996. See C. Labovitz. G. R. Malan, and R Jahanian, "Intemet routing instability." 
IEEE/ACM Transactions on Networi^ing. vol. 6, no. 5. pp. 515-528, October 1 998. As the Internet grows, and as support 
for mobility expands, an even greater need for forwarding tables that can be updated efficiently can be expected. 
[0012] Recently, there has been a flurry of work on the IP forwarding problem. In the present discussion, software 
approaches are emphasized. It ahould be noted that many software approaches can be implemented efficiently in hard- 
40 ware. One such hardware implementation of the present invention has also been provided. A comparison of several of 
these approaches can be found in the woric of Rlippe et al. See E. Rlippi, V. Innocenti, and V. Vtercellone. "Address 
lookup solutions for gigabit switch/router," in GLOBECOM*98. Global Internet Mini Conference, pp. 82-87, November 
1998. 

[001 3] Degermark et al. describe an approach optimized for execution on an off-the-shelf processor. To provkle effi- 
45 cient operation, Dagermaric et al keeps the table data small (so the entire forwarding table can fit into cache) while 
simultaneously trying to minimize the number of memory accesses required to search the table. See M. Degermark, A. 
Brodnik, S. Carisson, and S. Pink. "Small fbnvarding tables for fast routing lookups," in Proceedings ACM SIG- 
COMM'97, pp. 3- 14, 1997. This way of storing reduces the number of memory accesses (at the expense of greater 
memory utilization) by searching the prefix tree (^fectively a trie) only on three separate levels, as opposed to peribmri- 
50 ing one memory access for each of the 32 trie levels. In the prefix tree structure, only certain bit patters are possible at 
search of sonne level; Degenmark et al. are therefore able to utilize a data compression technique. Though the gains in 
their example are somewhat limited (they can effectively store 1 6 bits of a bit vector with only 1 0 bits, and this is only for 
one component of the overall data structure), their compression approach is a notable one. This way of storing 
addresses is not designed to support efficient updates. 
55 [0014] Waldvogel et al. See M. Waldvogel. G. Varghese, J. Turner, and B. Plattner, "Scalable high speed IP routing 
lookups," in Proceedings ACM SIGCOMM'97, pp. 25-36, 1 997 descn'be another approach for searching a trie structure. 
Rather than starting at the root of the trie and wortting down, their approach starts at ttie middle level, and wortcs up or 
down depending on the information it finds there. A level is searched quickly through hashing. Only nodes that would 
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be in the trie are stored. If a level is searched and no match is found, one knows that a smaller prefix is the only possi- 
bility. On the other band, a hit in the hash table could either mean the longest matching prefix has been found, or that 
one should look even deeper for a longer prefix. The key idea, however, is that not every level of the trie needs to be 
searched. The approach is quite scalable, requiring only 1gb level searches for b-bit prefixes. The selection of good 

5 hash functions-ones that can be calculated quickly and can evenly distribute the nodes-is not discussed herein, but is 
an important issue. Also, the structure makes heavy use of precomputation and does not support efficient updates. 
However, this approach Is used on a specific IP router designed by Partridge et al. See C. Partridge, P. P. Carvey, E. 
Burgess, I. Castineyra. T. Clarice, L Graham, M. Hathaway, R Herman. A. King, S. Kohalmi, T. Ma, J. Mcallen, T. Men- 
dez, W. C. Milliken, R. Pettyjohn, J. Rokosz, J. Seeger, M. Sollins, S. Storch, B. Tober, G. D. Troxel, D. Waitzman, and 

10 S. Winterble, "A 50-Gb/s IP router,' IEEE/ACM Transactions on Networidng, vol. 6, no. 3, pp. 237-248, June 1998. 
[0015] Nilsson and Karisson utilize a variation of a Patricia trie that replaces i complete levels of a binary trie with a 
single node of degree 21. See S. Nilsson and G. Karisson, 'Fast address lookup for Internet routers," in Proceedings 
of IEEE Broadband Communk:ations 98, April 1998. This approach results in a very dense table, but again is not 
designed to support efficient updates. Another approach that compresses the trie, and thereby reduces the average 

15 number of memory accesses per search, is described by Tzeng. See H.H.-Y. Tzeng, "Longest prefix search using com- 
presses trees," in GLOBE-COM'98, Global Internet Mini Conference, pp. 88-93, November 1998. 
[001 6] Lampson et al. propose a substantially different approach to the LMP problem by viewing it as a variation of 
binary search. See B. Lampson, V. Srinivasan, and G. Varghese, "IP lookups using multiway and multfcolumn search," 
in Proceedings IEEE INFOCOM'98, pp. 1248- 1256, 1998. Since a prefix represents a range of IP addresses, the prefix 

20 can be represented by two IP address-the smallest and the largest in the range. By sorting the (at most) 2p boundary 
addresses for p prefixes, we essentially define buckets of addresses where each address in the bucket has the same 
next hop. The approach is quite memory efficient, however insertion and deletion are relatively inefficient operations. 
[001 7] Srinivasan and V^rghese exploit the well-known technique of prefix expansion, which will reduce the number 
of memory accesses in a typk:al search at the cost of potentially increasing memory requirements and making updates 

25 of the forwarding table nrK>re difficult. See V. Srinivasan and G. V^rghese, "Faster IP lookups using controlled prefix 
expansion." in ACM SiGMETRICS'98, pp. 1-10, June 1998. The paper's main corrtribution is the description of a formal 
approach based on dynamk; programming to provide a way of searching that minimizes memory utilization. A simitar 
scheme based on prefix expansion is described by Gupta et al. See R Gupta, S. Lin, and N. McKeown, "Routing 
lookupjs in hardware at memory access speeds," in Proceedings IEEE INFOCOM'98, pp. 1240- 1247, 1998. Their 

30 approach, however, focuses on a proposed hardware implementation rather than optimality. 

[0018] The conventional approaches that have been discussed above are designed primarily to optimize lookup 
speed. In contrast to these approaches, Doeringer et al.'s woric on DP-Tries attempts to optimize fbnvarding table 
update speed as well as lookup speed. See W. Doeringer, GOnter Karjoth, and M. Nassehi, "Routing on longest-match- 
ing prefixes," IEEE/ACM Transactions on Networking, vol. 4, no. 1 , pp. 86-97, February 1996. Their approach is a vari- 

35 ation of tiie Patricia trie, with efficient insert and delete algorittims defined, as well as search. It is worth noting that more 
dynamk: structures such as DP-Tries can be used to complement an approach that emphasizes lookup speed, by main- 
taining an up-to-date routing table that is accessed by the fonwarding tables periodtoally. 

[0019] To acconrrmodate the ever expanding needs of applications, spedficalty networking applications, a structure 
and method storing prefixes related to addresses is required that at least meets the following criteria: 

40 

• Efffcient and scalable memory usage. 

• The structure should support efficient and ^mple insert, delete, and search operations. 
45 • The structure should support a pipelined hardware implementation. 

II. SUMMARY OF THE INVENTION 

[0020] To solve the problen^s in the prior art it is in objective of the present invention to provide a way of storing pre- 
50 fixes related to addresses. It is a further objective of the present invention to provide a way of storing prefixes related to 
networic addresses in a networicing system. It is another objective of ttie present invention to provide a networking sys- 
tem tiiat stores prefixes related to addresses in an efficient manner. It is yet another objective of the present invention 
to provide a computer program product that enables a computer assodated with routers in a networking system to store 
prefixes related to addresses in an efficient manner 
55 [0021 ] To meet the objectives of the present invention there is provided a method of storing a set of prefixes related 
to a set of addresses, said method comprising storing the prefixes in a binary the fashion wherein each node in said 
binary trie is associated with at least one of said prefixes and no node in said binary trie is empty. 
[0022] Preferably a first prefix is inserted into an empty trie by allocating a root node and placing said prefix in the 
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root node. 

[0023] Preferably a prefix other than a first prefix, comprising k-brts, with a representation bo, b^,-^ wherein k 

is an integer greater than 0, is inserted using a process connprising: designating a root node of said trie as a current 
node and b„= bQ and the prefix as the current prefix; temainating insertion if the cument node has the cun^ent prefix 
5 already stored in it; exannining the cun-ent node's left child, if b^ = 0 and exannining the current node's right child if b^ = 
1; allocating a new node and placing the cun'ent prefix if one of left child and right child do not exist, and designating 
said new node as the current node; assigning n=n+1 ; repeating until n=k; replacing a previously stored prefix in the cur- 
rent node with the cun^nt prefix and designating the previously stored prefix as the cun-ent prefix and repeating the 
steps. 

10 [0024] Preferably the trie is searched for an LMP of an address connprising k-bits, with a representation bo, b^ ,...,bk. 
1 wherein k is an integer greater than 0 using a process comprising: designating a root node as current node as well as 
an LMP node if the root node has a nnatching prefix and bn= bo ; designating cument node as an LMP node and car- 
rying said LMP node lower if the current node has a matching prefix and if the matching prefix is longer than the LMP 
node; designating the current node's left child as the cun^nt node, if bn = 0 and designating the current node's right 

15 child as the cun'ent node if b^ = 1 ; n=n+1 ; repeating the steps until the current node is at a lowest level of the trie; and 
selecting a prefix corresponding to said lowest trie as an LMP if said prefix is a match. 

[0025] Preferably, a prefix corresponding to an address is deleted in said the using a process comprising: searching 
for a matching node con^ponding to said prefix; deleting the matching node if the matching node is a leaf node and 
terminating the process; deleting said matching node and moving up one of said matching node's children if said match- 
20 ing node is not a leaf node and deleting said one of said matching node's children; and repeating the steps until a leaf 
node is deleted. 

[0026] Preferably the trie is balanced for minimizing a depth in a worst-case search. 

[0027] Another aspect of the present invention is a method of converting a simple trie with stored addresses into a 
depth-optimal sub4rie that has all nodes representing addresses, said method comprising: finding a lowest level of said 
25 simple trie that has a full node and designating said lowest level as i, wherein i is an integer; examining each node at a 
level conresponding to i-1 ; moving up a prefix if there is an empty node at level i-1 from a bottom of the deeper subtrie 
of the empty node; and continuing said merging until the root node is reached. 

[0028] Yet another aspect of the present inventions is a method of converting a simple trie, with stored addresses 
and known probabilities of visiting each node in said simple trie, into a search-optimal trie with a minimum number of 
30 expected steps per search, said method using dynamic programming and said method comprising: cateulating an an^ 
A„ for each node a using a bottom-up process such that AJi] holds a least expected number of search steps assuming 
i nodes are promoted out of a sub-trie with a as a root, wherein, 

Aa[n = nAp,A^,Pp,P^) 

35 

p and Y are the left and right children of a; 

Pp and P^ represent the probability that p and y are visited during a search assuming that a has been visited; asso- 
ciating witti each AJi] a number of prefixes that must be promoted from p and y to generate optimal subtries asso- 
40 ciated with each AJi]; and working recursively top-down from the root to issue requests to child nodes to promote 
prefixes up, the root node requesting 1 prefix if the root node does not hold a prefix, the root node requesting 0 pre- 
fix if the root node holds a prefix, said requests being based on the anay A and the associated numbers. 

[0029] Preferably the stored prefixes are related to Internet addresses and said trie is k)cated in an IP router. 
45 [0030] Still another aspect of the present invention is a networtcing system comprising a plurality of routers, each 
router having an address storage, wherein in each address storage a set of prefixes related to network addresses cor- 
responding to the networi< system are stored in a form of a binary trie ,said binary trie comprising a plurality of nodes 
wherein each node is associated with a prefix of at least one of said network addresses and no node is said binary trie 
is empty. 

50 [0031 ] Preferably the binary trie is balanced for minimizing a depth in a worst-case search. 

[0032] Still another aspect of the present invention is a computer program product including a computer-readable 
medium, said program enabling one or more of computes associated with a networking system to store a set of 
addresses stored in each router within said networking system in a binary trie fashion wherein each node in said binary 
trie is associated with a prefix of at least one of said addresses and no node in said binary trie is empty. 

55 [0033] Preferably the computer-program product of claim 1 4 wherein said trie is balanced for minimizing a depth in 
a worst-case search. 

[0034] Still another aspect of the present invention is a system for storing a set of addresses in a binary trie fashion 
wherein each node in said binary trie is associated with a prefix of at least one of said addresses, wherein said system 
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comprises a pipeline, said pipeline further comprising a plurality of stages, each stage from said plurality of stages cor- 
responding to a level in said binary trie, said stage consisting essentially of a memory component, a bank of latches 
and a simple logic, said bank of latches storing a prefix, a destination IP address, a pointer ponting to an appropriate 
node, an instruction that indicates the task of the con^ponding stage, and a state containing infomnatlon about a state 
5 of the instruction. 

[0035] Preferably, each stage of said pipeline comprises latches holding input and output information, memory con- 
taining information corresponding to nodes at a level, a stack containing pointers to unused node addresses and com- 
parators. 

10 III. LIST OF FIGURES 

[0036] The above objectives and advantages of the present invention will become more apparent by describing in 
detail prefen-ed embodiments thereof with reference to the attached drawings in which: 

15 FIG.1 depits prefix distribution, no distribution, and depth optimal distn'bution for mae-west prefixes. 

FIG.2 shows a prefemed embodiment of the present invention after the insertion of sequence (10011, 01100, 
1 1 1 1 ,01 00, 01 1 11 0, 1 01 00, 01 0, 0001 1 ). 

20 FIG.3 shows the preferred embodiment after the insertion of prefix 01 . 

FIG.4 shows the preferred embodiment after the deletion of prefix 01 100 is. 

FIG.5 shows the preferred embodiment with the same prefixes as FIG.2 but with a different insertion sequence 
25 (0100, 01, 10011,000111.010, 10100, 1111,011110) 

FIG.6 shows two possible bonsai structures. 

FIG.7 shows the merging of two depth optimal subtries. 

30 

FIG. 8 shows a search optimal subtrie. 

FIG.9 shows depth for random, depth optimal, and the search optimal subtrie. 
35 FIG.1 0 shows average node level for random, depth-optimal, and search-optimal bonsai. 

FIG.1 1 shows average comparisons per search for random, depth-optimal, and search-optimal bonsai. 
FIG.1 2 shows a distn'bution for the firat byte of the prefixes for the mae-east fon^varding table. 

40 

FIG. 1 3 shows a distribution for the first byte of the destination I P addresses for the fix-west to trace. 
FIG.1 4 illustrates an example of a pipeline implementation of bonsai. 
45 FIG.1 5 shows an example of a search in stage i using the pipeline Implementation of FIG.1 4. 
FIG.1 6 shows an example of an insert in stage i using the pipeline implementation of FIG. 14. 
FIG.1 7 shows an example of a delete in stage i using the pipeline implementation of FIG.1 4. 

so 

FIG.1 8 shows an embodiment of a networic system with routers storing prefixes related to addressed in an efficient 
manner according to the present invention. 

FIG. 19 shows an example of a conventional trie stmcture. 

55 

FIG. 20 shows an example of a conventional trie structure and an equivalent conventional Patricia trie structure. 
[0037] In temns of design objectives, the approach of the present invention has much in common with Doeringer et 
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al/s DP-Tries, but there are several notable differences: 

• The algorithnns for search, insert, and delete are considerably simpler than those for the DP-Tries. 

5 • Nodes in the present invention are relatively simple in comparison to the DP-Trie nodes. Each DP-Trie node 
requires three pointers to other nodes, two prefixes, and one index value that must be able to represent integers in 
the range of 0 up to the number of bits in an IP address. The nodes in the present invention contains one prefix and 
two pointers to other nodes. 

10 ♦ DP-Tries in general have more nodes than the number of prefixes. These "overhead" nodes store infomiation 
required to search the table. The present invention has no overhead nodes. 

[0038] The structure of a DP-Trie depends only on the prefixes in the table, not upon the ordering of the insertions 
and deletions of the prefixes. The structure of the trie used in the present invention is, in general, dependent on the 
15 insert and delete order. 

[0039] The advantages of the present approach include the following: 

• MenDory usage is efficient and scalable. The bonsai uses only a single node for each prefix, with each node com- 
prising two node pointers and a prefix pointer. It is straightforward to store the nodes and the prefixes in arrays; 

20 therefore the pointer sizes can be restricted to Ig p bits where p is the number of prefixes stored 

• The structure supports efficient and simple insert, delete, and search operations. If b is the number of bits in an IP 
address, the algorithnns will require 0(b) time. 

25 • The way of storing prefixes according to the present invention is dependent on the sequence of insert operations. 
The present invention provides different optinnanty criteria The first is a greedy algorithm that calculates the binary 
trie with minimum overall depth. The second is a dynamic programming approach that derives the bonsai trie with 
the minimal number of expected steps per search. This search-optimal bonsai approach can assume an arbitrary 
distribution of IP destination addresses. 

30 

• The present invention is particularty well suited for a pipelined hardware implementation. Throughput can be as 
high as one search (that is, one longest prefix match) per memory-access time. Inserts and deletes can be accom- 
plished with no more than two clock cycle stalls in the pipeline. 

35 [0040] Intuitively, the present invention is a variant of the trie approach. It eliminates nodes that are not associated 
with table entries by moving the prefixes in a trie upwards, until all nodes are associated with a prefix. Such an approach 
has two positive effects. Rrst, it reduces memory usage. Second, it makes the trie more "shallow", potentially allowing 
for fewer memory lookups per search. As an exanrtple, FIG. 1 shows the trie prefix distribution, the trie node distribution, 
and the prefix/node distribution for the depth-optimal bonsai (discussed further in Section IVC.1) and for the mae-east 

40 fonwarding table (discussed further is Section IVD). The levels of the trie are labeled from 0 (the root) to 32 (the maxi- 
mum length of a prefix). The present invention greatly reduces the number of nodes needed, further, prefixes are moved 
further up in the trie, reducing the number of steps per search. 

IVA. Bonsai 

45 

[0041] The preferred embodiment of the present invention, described herein, is named bonsai. Bonsai stores pre- 
fixes related to Internet addresses. The bonsai is a binary trie, where each node has an associated prefix. Insert, 
search, and delete operations, as welt as certain implementation issues are discussed herein in detail. The bonsai has 
certain invariants whteh are presented here. Following is a proof showing that tiiese invariants hold under the insert and 
50 delete operations. 

Lemma 1 (Bonsai invariants) 

[0042] A bonsai has the following properties: 

55 

1 . The bonsai is a "packed" trie, in the sense that it contains only nodes that represent a routing-table entrie. 

2. All potential matching prefixes for an IP address can be found by descending the bonsai in typical the fashion. 
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i.e.. by using the ith bit of the IP address to choose the direction taken at level i of the bonsai. 
IVA.1 Insert 

5 [0043] When the first prefix is inserted into an ennpty trie, a root node is allocated and the prefix pointer of the node 
is set appropriately. Subsequent insertions follow their way down the the structure in the usual way, until the first free 
location is found. If a k-bit prefix with a binary representation of bo, b-|,.„,b|(.^ is inserted, the algorithm starts at the root 
node. If bo = 0» the root node's left child is exanained; otherwise the right child is examined. If no left (right) child exists, 
a node is allocated at that position and the prefix is placed there. If a left (right) child already exists, then bit b1 is exam- 

10 ined in the context of the child node. Only one copy of a prefix is allowed in the trie. If a duplicate copy is inserted, rt will 
be found during the descent down the trie, and the current insertion will be stopped without modifying the trie. FIG. 2 
shows the status of the bonsai after a sequence of prefixes has been inserted. 

[0044] Not all prefixes will fall through the trie and become a leaf node, however. For example, consider insertion of 
the prefix 01 into the trie shown in FIG. 2. After following the 0-child of the node holding prefix 1 001 1 and the 1 -child of 

15 the node holding prefix 01 1 00, there is no way to go further down the trie. When a prefix x falls through the trie as far 
as it can and finds a node already there holding prefix y, prefix y is dislodged and allowed to fall further down the trie, 
just as if y were being inserted. Unless y equals x (in whk:h case the insertion terminates), y will be longer than x and 
will therefore be able to fall further down the trie. Note that an insertion may cause numerous prefixes to be dislodged, 
but the procedure will require in the worst case only 0(d) operations where d is the depth of the trie. For example, FIG.3 

2o shows the state of the example trie after the 01 prefix has been inserted. The 01 prefix dislodges the 01 00 prefix, whk:h 
falls two levels down in the trie where it generates a leaf node. 

IVA^ Search 

25 [0045] Searching the bonsai given an IP address is relatively intuitive. One descends the trie in the usual way, as 
noted above in the section describing insertion. At each step of the descent, a comparison is made to see if the IP 
address is a match for the stored prefix. If so, and if that prefix is longer than any previously found match, a pointer to 
the node is carried along as lower levels of the trie are searched. An IP address may match several prefixes as it 
descends the trie, but all potential matches will be in its path. 

30 [0046] One consequence of this approach is that at each level of the trie, a comparison against the stored prefix is 
required. Such comparisons are not necessary for a pure trie approach, and will add a constant factor cost 
[0047] Consider searching the bonsai in FIG.2 for the LM P of 01 000000..., an I P address. At the root node there is 
no match with prefix 1001 1. The 0-child node is visited, but there is no match with prefix 01 100. That node's l-chiW is 
visited, and there is a match with prefix 0100, so this prefix is remembered. Rnally, that node's 0-child is visited, and 

35 again a match is found with prefix 01 0. However, this new match is shorter than the previous match. Since one can go 
no further down the trie, 0100 must be the IMP. 

IVA^ Delete 

40 [0048] As with insert and search, the delete operation involves a traversal down the trie, searching for matches 
against the prefixes. If the prefix to be deleted is resident at a leaf node, the prefix is deleted and the node is removed 
from the trie, requiring an update of one of its parent node's child pointers. If the prefix to be deleted is associated with 
a node that is not a leaf, however, care must be taken to maintain the trie structure. The key insight is that any prefix in 
the node's subtrie can replace the deleted prefix. Though there are many ways to select a replacement One that will 

45 be easily pipelined is chosen in the present context The prefix associated with a child node is moved up, and replaced 
with the prefix of one of its children, etc., until eventually a leaf node is reached. Then the leaf node (whose prefix has 
moved up to the parent node) can be deleted. In such cases, the prefixes can be viewed as "percolating' up the trie. 
Note that it is always a leaf node that is deleted. 

[0049] In the case where there are two children for a node, it will be possible to choose either the left or right child's 
50 prefix to percolate up. (If a node has only one child, there is no choice.) When there is a choice, it is possible to use a 

statfc approach (e.g., the 0-child is prefemed), or a more dynamic one (e.g., random selection). 

[0050] For example, consider deleting prefix 01 1 00 from FIG. 3. The process and resulting trie are shown in FIG. 4. 

The process is starts at the root node, no match is found with prefix 1 001 1 . 0-child of the root node is visited next, and 

a matching prefix to be deleted is found. Either the 0-chitd or the 1 -child prefix can be percolated. Assume that the 1- 
55 child is preferred and 01 percolates up. Thereafter, assume that the 0-child is preferred and 010 percolates up. This 

node has only one child, so 01 00 percolates up. Since the leaf node has been reached, it is deleted. 
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IV.B Optimal Bonsai Tries 

[0051 ] One consequence of the bonsai operations is that the structure of the trie is dependent (in general) upon the 
order or the insert and delete operations. For example, the laonsai in FIG. 5 contains the same prefixes as the bonsai 
5 in FIG.2. However, the bonsai in FIG. 5 has a smaller average depth for the prefixes. It Is therefore possible to manipu- 
late the trie to optimize some perfbmnance metria For example, it may be desirable to balance the trie as much as pos- 
sible, so as to minimize the depth of the worst-case search. Or it may be desirable to minimize the depth of the average- 
case search. 

[0052] It is worth noting that minimizing the worst-case and the average-case search are conflicting criteria For 
10 example, consider the two small bonsai structures shown in FIG. 6. Assuming a unifbmi probability for all IP addresses, 
the expected number of comparisons to search trie (a) will be 2, since all searches will require exactly 2 comparisons. 
For trie (b), the expected number is (50%)(1 ) + (25%)(2) + (25%)(3) = 1 .75. 

[0053] Assuming unifbmri distn'bution, a unbalanced trie has better average-case search performance than the bal- 
anced trie. Of course, the uniform distribution assumption will not be valid in real routers. However, whenever the prob- 
es ability distribution is known or can be estimated-for example, by tallying each time a node is accessed during a search- 
it will be possible to adjust the trie to optimize average-case behavior. Though calculation of optimal bonsai may be too 
time-consuming after each insert or delete operation, it may be reasonable to periodically restructure the bonsai so that 
it better meets the optimization criterion. 

[0054] In the following sections prefen-ed embodiments of methods for calculating two different types of optimal 
20 bonsai is provided. The first is a greedy algorithm that calculates the bonsai trie with minimum overall depth. The sec- 
ond is a dynamic programming approach that derives the bonsai trie with the minimal expected number of search steps, 
based on an arb'rtrary distribution of destination IP addresses. 

[0055] Temriinology related to the optimization methods is discussed herein. An empty node is a trie node that does 
not represent a routing-table entry. A full node does represent an entry. A subtrie consists of levels of nodes, where 

25 nodes of level i are I hops from the root of the subtrie. The root of the subtrie is the only node at level 0. Greek letters 
are used to represent nodes. The level of node is a labeled d^. The root node that this level is relative to should be clear 
from the context If the root node of some subtrie is labeled a, it may also be used to represent the entire subtrie rooted 
at a; again, the meaning should be clear from the context For any subtrie rooted at a, let w" represent the total number 
of prefixes at or below level i. w? is called the weight of level i of the subtrie rooted at a. For example, if a is the root 

30 node of the full trie in FIG. 2, then wg = 8, w ? = 7, w? = 5,W3 = 2, and w 7 = 0 for all i > 4. Let p be the node with 
prefix 01 00: then wE = 3, w ? = 2, and w ? = 0 for all i ^ 2. When the subtrie is obvious from context W| is also used to 
represent the subtrie. Rnally, the depth of a subtrie is the level of its deepest node. 

IVB.1 E)epth-Optimal Bonsai 

35 

[0056] A greedy algorithm is described herein that starts with a simple trie, compresses it by removing all nodes 
that do not represent a routing table entrie, and creates a bonsai with minimum depth and minimum average node (pre- 
fix) level. This prefemed embodiment is called a depth-optimal bonsai. A depth-optinnal subtrie is a subtrie such that no 
other subtrie with the same set of prefixes has a smaller w-, for any i. 

40 [0057] The algorithm worics from the bottom up. From the basic trie structure for the routing table, including both 
empty and full nodes, depth-optimal subtries are recursively merged up. The lowest level of the trie where there is a full 
node is found. This is calledlevel i. (All nodes found at this level will necessarily be full.) Each node of the trie at level i 
- 1 is examined. (For a node to exist at any level in the tile, it must either be full or have at least one full descendant) If 
a node at level i - 1 is full, no merging of subtries is possible and therefore no action is taken, ff the node is empty, this 

45 implies that a prefix can be moved up (promoted) from one of its subtries, resulting in a depth-optimal subtrie rooted at 
level i - 1. In an arbitrary (full) node from the lowest level of the deeper subtrie is chosen to be promoted, as shown in 
FIG. 7. If both subtries have the same depth, a node is artJitrarily selected from the lowest level of either subtrie. This 
merging process continues up the levels of the trie until the root node is reached. 

50 Lemma 2 (Depth-optimal subtrie invariants) 

[0058] Using the algorithm described above, each depth-optimal subtrie has the following properties: 

1 . All prefixes in the subtrie have a common substring represented by the position of the subtrie root node within 
55 the trie, and all prefixes that share the substring are in the subtrie. 

2. The subtrie is depth-optimal in the sense that it is impossible to rearrange the prefixes such that the level weights, 
Wj, can be reduced. 
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[0059] Proof: Induction on the levels of the trie is used, beginning with level i, the level of the deepest (full) node in 
the original trie. 

[0060] Node a in level i must be full, and can have no children. In this case rt is clear that the invariants of Lemma 
2 are maintained. It is now shown that all subtries rooted at level j - 1 maintain the invariants, assuming all subtries 

5 rooted at level j maintain them. Node a in level j - 1 will be either full or empty. 

[0061] If a is full, there is no variation on the structure of the subtrie rooted at a that allows for a lower W| for any i. 
This is proven by contradiction. Assume there does exist a reorganization of the subtrie that would improve some W|. 
This reorganization cannot invoh^ moving the prefix associated with a to a tower level, since this would violate the first 
subtrie invariant; the prefix is already located at its lowest possible level. Therefore the reorganization must be accom- 

10 plished leaving the prefix associated with a in place. It is Impossible to move a prefix from one subtrie to another, since 
the root nodes of the subtries represent distinct prefixes that cannot be nested. This implies that either the left subtrie 
or the right subtrie can be improved given its current prefixes, which contradicts the assumption that the merging sub- 
tries are depth-optimal. 

[0062] If a is empty, the algorithm selects a prefix, 5, from the lowest level of the deeper subtrie to take its position. 
15 Let p be the root of the left depth-optimal subtrie of a. and y be the root of the right depth-optimal subtrie of a. Without 
loss of generality, assume that subtrie p is deeper than y. By promoting 5, the resulting subtrie rooted at a will have wo 
- 1, and 

20 

for all i such that w /?i is greater than 0. In effect, the promotion of 5 has resulted in a left subtrie with level weight w ? 
reduced by one for each non-zero weight. To prove the resulting subtrie is depth-optinnal, it must be shown that there is 
no other node, € , that can be promoted instead of 5 that can result in a superior subtrie. Assume € is in the subtrie 
rooted at Pw For the promotion of e to be superior, the resulting left subtrie must have a level weight less than w ? - 1 
25 for some level j. Note, however, that insertion of a prefix can add at most 1 to any level weight Thus, if one starts with 
the left subtrie without e and then inserted it, one would generate a new (complete) left subtrie with a level j weight less 
than w ? . This violates the depth-optimality assumption for the left subtrie. Thus, no node will improve the subtrie more 
than promoting a prefix at the lowest level. A similar argument handles the case for when e is assumed to come from 
the right subtrie. 

30 [0083] A consequence of Lemma 2 is that the algorithm generates a bonsai trie that is depth-optimal in the sense 
described above. It can also be shown that the number of levels in the bonsai trie is minimized, and that the average 
depth of the nodes is minimized. 

IVB.2 Search-Optinnal Bonsai 

35 

[0084] In this section, a preferred emt>odiment of a dynamic programming method that computes the bonsai trie 
with the minimum number of expected steps per search. The approach assumes an arbitrary distribution of destination 
IP addresses. The structure starts as a simple trie that is augmented such that the probability of visiting each node on 
any given search is known. (For example, the root node must be visited on every search, so its probability is set to 1 .) 

40 In practice, this probability distribution is likely to change overtime. However, a distribution can easily be estimated for 
any desired period of time by tallying the nodes that are visited in each search of the simple trie. 
[0085] Dynamic progrannming is useful here because the problem exhibits both optimal substructure and overiap- 
ping subproblems. See T H. Comnen, C. E. Leiserson, and R. L. Rivest, Intiroduction to Algorithnns. MIT Press, 1990. 
In this preferred enrrbodiment one begins at the lower levels of the simple trie, and proceed upwards by promoting pre- 

45 fixes appropriately. However, the number of prefixes that must be promoted out of any subtrie will not be immediately 
known. The approach therefore uses in two phases. 

[0086] In the first phase, for each node, a. in the simple trie an army, A^ is calculated. A^ p] will hold the optimal 
(least) expected number of search steps for this subtrie, assuming i prefixes are promoted out of the subtrie. How large 
do these an^ys need to be? Since a node can only promote prefixes to its direct ancestors, node a will never have to 

50 consider the promotion of more than d„ nodes. Therefore, values of AJ}\ in the range 0 < i < d„. It should be noted that 
there must be a special value for array elements that represents an infeasible number of promotions. For example, it is 
impossible to promote 4 prefixes from a sut)trie that contains only 3 to begin with. During the cak;ulation of these arrays, 
which is a bottom-up process, the number of prefixes that must be promoted from both the left and right subtries to gen- 
erate the optimal subtrie must also be retained. Once the arrays are cak:ulated, the second phase works from the top 

55 down to discover the optimal structure and create the corresponding bonsai. Starting with the root node-whk:h does not 
need to promote any prefixes-each node will issue requests to its left and right children to promote some number of pre- 
fixes and generate an optimal subtrie based on that number. 

[0087] Consider the first phase, which isthe cafculation of the A arrays. FIG. 8 describes the basic situation. Label 
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the root node of the subtrie a, the left child P and the right child 7. To calculate the knowledge of Ap and Ay is 
needed, as well as the probability that each child will be visited (assuming the root node has been visited), whbh are 
labeled pp and py (Note that if both the left and the right child exist. Pp + Py will equal 1 .) 

[0068] Rrst consider the case when node a already contains a prefix. The subtrie structure corresponding to [0] 
is straightforward; it is the case where no nodes are pronnoted out of either the left or right subtrie. 
Aa [0] = 1 +PpAp[0}+p^^O] . For calculation of AJ1] there are two possibilities to consider. Promotion of 1 from the 
left and 0 from the right, or promotion of 0 from the left and 1 from the right The best choice depends on which value 
is smaller, PpAp[1Hay.Ay[0] or PpAp[0]+fYAy[1 ]. This procedure continues until alt required values of A„ are calculated. 
In general: 



A J/J = 1 + min {p p A p m + ^ [*] } 



within the range 0 < i < do. Again, the j and k value that generated the minimal AJi] should also be retained. 
[0089] In this case where a is empty, the need to promote a prefix into a as well as promote prefixes above a. Even 
promoting zero prefixes from a will require promotion of one prefix from one of the two subtries; the best chotee 
depends on whk:h value is smaller. PpAp[1]+pyA^0] or PpAp[0}+f)yAj[1]. In general: 



within the range 0 < i < d„. 

IV.C Experimental Results and Analysis 

[0070] In the experiments, Internet forwarding tables made available at the Internet Perfomriance Measurement and 
Analysis (IPMA) web site is used. See Intemet Performance Measurement and Analysis Project (IPMA). Available at 
httpy/nfe.meriL eduAipma/. These fonrarding tables, which are updated daily, have become standards for IP forwarding 
experiments. The data used here is from 1 7 August 1 998. In order to simulate a realistic distribution of IP datagram des- 
tinations, a trace of real datagrams destination IP addresses from fix-west is used. The trace contains 2.146.573 
addresses (five-minutes' worth), recorded on 22 February 1 997. This trace is made available by the National Laboratory 
for Applied Networic Research (NLANR). See National Laboratory for Applied Networic Research (NLANR). Available 
at http:/Avw.nlanr.net/NA/. It is to be noted that this trace was not gathered from the routers whose forwarding tables are 
available at the IPMA page. 

[0071 ] Fou r metrics for the bonsai are considered: 

• the depth, 

• the average level of a node/prefix. 

• the expected number of steps (or comparisons) per search, assuming a uniform distribution of destination IP 
addresses, and 

• the expected number of steps per search, assuming the distribution of destination IP addresses defined by the fix- 
west trace. 

[0072] Table 2 provides some description of the routing tables for the five locations. For each site, the number of 
prefixes in the table equal to the number of nodes in the bonsai are listed. Also listed are the hit rate and miss rate for 
the fix-west trace relative to the given forwarding table. As discussed eariier, the trace is not associated with the forward- 
ing tables, creating the possibilrty of a significant fraction of misses. 
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Site 


Prefixes |{ T>ace Statistics 1 




■:-fflirTrii7rR-,-;7!Tr» 


aads 


24325 


37.639G 


62.37% 




41123 


95.68% 


4.32% 


mae-ivest 


19260 


69.43X 


30.57% 


paix 


4241 


6.33% 


93.67% 


Pb 


22830 


30.83% 


69.17% 



'IU>Ie 2: Metrics for routing tables at five locations, 17 Augost 1998. Hit rate 
and miss rate fisr five minutes of destination IP addresses from fix-west around 
noon, 22 Fd)niary 1997. Tbe trace contains 2,146,573 IP datagrams. 



[0073] Table 3 contains information for 100 bonsai where the prefixes are inserted in random order. For each met- 
ric, we show the minimum, average, and maximum values. Fairly consistent behavior for all the routers is noted, with 
the numbers for paix typically being somewhat smaller, due to the fact that it holds fewer prefixes. The bonsai have a 
typical depth of around 24, while the average level of a node is approximately 1 8 for the large tables. Also note that the 
average number of comparisons per search is much smaller for the unrfonn distribution than for the fix-west trace. This 
is because the uniform distribution assumes a large percentage of destination addresses in very sparse areas of the 
bonsai, where there are few possible matches. This phenomenon is discussed in more detail later in this section. 
[0074] Metrics for the depth-optimal bonsai are listed in Table 4. In all cases the depth-optimal bonsai is more shal- 
low than the best of the 1 00 random bonsai. The depth-optimal bonsai will also have the smallest average node level. 
Interestingly, the depth-optimal bonsai is worse than the average random bonsai in terms of comparisons per search. 
Intuitively, this occurs because the deep nodes, of the trie, while they can add to the depth, are less likely to be visited 
during a search than nodes higher up in the trie. 

[0075] Table 5 contains data for the search-optimal bonsai. For this optimization, the depth and the average node 
level are worse than for the random bonsai, but we make real gains in terms of comparisons per search. Again, this indi- 
cates the tradeoff between depth and search time. 

[0076] To facilitate direct comparisons between the different bonsai approaches, bar graphs for depth (FIG. 9), 
average node level (FIG. 1 0), and average comparisons per search for the fix-west trace (FIG. 1 1 ) are included. It is to 
be noted that the depth-optimal t>onsai has modest but consistent advantages compared to other approaches for both 
depth and average node level: improvements in depth relative to random insertion range from 4% to 11%, while 
improvements in average node level range from 1% to 2%. It should also be noted that the search-optimal bonsai is 
worse than the average random bonsai for these metrics. 
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5X3 
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13^ 


13X4 
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ThUe 3: Statistics for 100 bonsai for eadi locatioo. Prefixes wen inserted in 
random order. 
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5X5 
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7X8 
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4: Statistics depth-optimal bonsai fior each location. 
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6.94 
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paox 


25 
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5X2 


12.27 


pb 


25 


17X1 


6X5 


14X5 



TO^S: Stafeto for searcfa^optimal bonsai (relative to the 
for each location. 



[0077] The data on the average number of comparisons per search, however, shows the advantages of the search- 
optimal bonsai. Improvements relative to the average random bonsai range from 9% to 13%. It should also be noted 
that the depth-optimal bonsai perfomns relatively poorly for this metric, though it does slightly outperform the random 
bonsai for the mae-west forwarding table. 

[0078] In general the results suggest that random insertion of prefixes will provide reasonable perfomnance for all 
the metrics considered. 

[0079] There are several caveats related to the interpretation of these experimental results. For one thing, the trace 
is not taken from the routers that arc examined but from a completely separate location. The extent to which the trace 
provides a realistic distribution is therefore debatable. However, when the effect on perfomnance is considered, misses 
are not necessarily problematic, since even misses require a full bonsai search. In other vrards, misses are not neces- 
sarily performance outliers. 
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[0080] Another issue is that there are certain peculiarities related to the assignment of IP addresses that can affect 
perfomnance. For exannple: 

• The fonwarding tables have no entries for either Class D IP addresses (which begin with "1110" and are used for 
5 multicast) or class E IP ad-dresses (which begin with "11110" and are reserved for future use). 

• The fix-west trace has no class E addresses, but 25,719 destination ad-dresses are in class D (approximately 
1.2%). 

10 • The uniform distribution assumes Class D and Class E addresses are possible, with frequencies of approximately 
6.3% and 3.1%, respectively. 

Since the bonsai will contain no prefixes for Class D or Class E addresses, it will not have a node at or below locations 
" 1 1 1 0" and "1 11 1 0". In fact, only Class D or E addresses can start with "111". Thus, Class D and E searches are faster 
15 than average-at most 3 levels of the bonsai will be searched. Similar effects can be seen in other areas of the address 
space: none of the fon*rarding tables contains a prefix that begins with "1 1 1 " or "01 0", and each table contains only one 
prefix that begins with "01 1 fThe entrie that begins with "01 1 " is in all cases 1 27/255, which is reserved for loopback 
testing.) 

[0081] In light of the importance of prefix and IP destination address distributions, rt is worthwhile examining real 
20 forwarding tables and real traces. FIG. 12 shows the distn'bution of prefixes from the mae-east forwarding table, based 
on the dotted decimal notation of the first byte of the prefix. Both linear and logarithmic scales are shown. The data 
shows that class C prefixes are far more common than others. There are very few Class A addresses, while the Class 
B addresses are fairty evenly distributed within a certain range. Again, these distributions have important consequences 
for trie-based approaches. The left side of a bonsai, for example, will be very sparsely populated in comparison to the 
25 right side. And similar arguments hold further down the trie. 

[0082] FIG. 13 shows the distn'bution for the first byte of the destination IP ad-dresses for the fix-west trace. The 
data shows again that a uniform distribution is not a good model of the traffic. Traffic for Class B and Class C is far larger 
than for Class A. In fact, two bytes-128 and 192 in decimal notation-account for more that one third of all destination 
addresses. 

30 

IVD. Pipeline Implementation 

[0083] Though the discussion so far has impllcttty focused on a software implementation, the prefenned embodiment 
bonsai also lends itself to dedicated hardware implementation. Through-put can be as high as one search per memory- 

35 access time. Inserts and deletes can be accomplished with no more than two clock cycle stalls in the pipeline. In this 
section, a preferred embodiment of pipeline implementatk>n of bonsai is presented. It should be noted, the pipelining 
method is not restricted to bonsai. Many LMP search approaches can be pipelined in a similar manner. Gupta et al. pro- 
vide one example-but for nrtany of these approaches inserts and deletes are problematic. See R Gupta, S. Lin, and N. 
McKeown, "Routing lookups in hardware at memory access speeds," in Proceedings IEEE INFOCOM'98, pp. 1240- 

40 1247,1998. 

[0084] Consider FIG. 14, whfch shows a abstract pipeline for a bonsai. A pipeline stage consists of a menriory com- 
ponent, some simple logic, and a bank of latches. The simplest implementation would handle one level of the bonsai at 
each stage, requiring d stages for a depth-d bonsai, (to reduce the number of stages it can be helpful to construct the 
depth-optinnal bonsai.) At the input to stage 0, the destination IP address is sent for a search or the prefix for an insert 
45 or delete. Snnall instruction code to allow the pipeline to cBstinguish between searches, inserts, and deletes is also sent 
[0085] For the purposes of this discussion, assunDe the pipeline has one stage for each level of the bonsai. The data 
in level i of the bonsai will be stored in stage i of the pipeline. The latches before each stage will store the following infor- 
mation: 

50 • A prefix, P, used as an input for inserts and deletes and an output for searches. 

• A destination IP address, D, used as an input for searches. 

• A pointer, R, which points to the appropriate node in the current level. 

55 

• An instruction, 1, whrch indicates whether this stage of the pipeline is doing a search, an insert, or a delete. 

• A state, S, which contains infomnation about the state of the cun^ent instruction. For example, during a search one 
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may need to know if an previous nnatches have been found, or during a delete one may be promoting prefixes at 
this point rather than deleting. 

The basic hardware for each stage of the pipeline will include the following: 

5 

• Latches that hold relevant input and output information, as described above. 

• Memory that contains information on the nodes of the level. Nodes are of fixed size, and contain a prefix and two 
pointers. When writing to a node, it is possible to write just to a node's prefix or to one of its two child pointers. 

10 

• A stadc (or some other structure) that contains pointers to unused node addresses in the subsequent stage. For 
example, during an insert, some node will need to allocate memory for a child. Similarty, during a delete, some 
node will have to deallocate memory for a child. Memory management is quite simple due to the fixed size of the 
nodes. 

15 

• Conrtparitors, for example to check prefix matches or compare prefix lengths. 

• A variety of more bask: building blocks, such as multiplexers and logc gates. 

20 [0086] It is to be noted that although subsequent instructions in the pipeline are independent, it will be necessary 
to stall the pipeline in some cases. TTiese can occur only during inserts or deletes, however, and will never cause more 
than 2 clock cyde stalls. 

[0087] Though an exhaustive description of the pipeline design is inappropriate here, the hardware required for 
some specific cases is examined, to provide a feel for the design requirements. FIG. 15 shows a detailed example of 
25 potential operation during a search. At this point, assume that the relevant node at stage i exists, and that another rel- 
evant node exists at stage i + 1 . It is checked to see if the destiantion IP address matches the prefix In stage i. If the 
prefix does match, and if it is longer than any eariier matches, it will be passed along to the next stage, along with the 
appropriate pointer for the next node. 

[0088] The simple case for an insert occurs when an inserted prefix falls into the first empty node. This can be 
30 accomplished with a single stall in the pipeline, whrch is required to write a new pointer into the parent of the new leaf 
node. FIG. 16 shows this case. Rrst, the memory is read, and it is found that there is no appropriate child node. A 
pointer is allocated at the pointer stack for the new node. This pointer is written back into the memory (during the sec- 
ond clock cycle), and it is also passed on to the subsequent stage, along with the prefix to be inserted. Another case 
occurs when an insert causes another prefix to be dislodged, if a prefix is dislodged in stage i. this will require two mem- 
35 ory accesses at stage i: a read of the prefix to be dislodged, followed by a write of the prefix to be Inserted. This can be 
accomplished with one stall of the pipeline. Note that even if nrwre prefixes are dislodged for this insert further down the 
pipeline, no more stalls will be necessary. 

[0089] Deletes are more difficult since a deleted prefix In level i will require a pointer update in level i - 1 , whteh will 
require bypass hardware between stages. Level i can be read during clock cyde j. Level i - 1 can be written during dock 

40 cycle j ■»- 1 (note that the pointer to the parent node must have been saved in the latches). In general, the children of the 
deleted prefix need also to be promoted. Stage i + 1 can be read in dock cyde j + 1 , and written back to stage i during 
ckx* cycle j + 2. The beginning of this action is shown in FIG. 1 7. It is found that there is no match in stage i. It should 
be noted that a child exists, so a prefix needs to be promoted. The bypass hardware needed for the write-back is not 
shown for darity In the worst case, deletes will force two dock cycles of stall. 

45 [0090] One difficulty with this design is the imbalance in the number of nodes per bonsai level, which con^ponds 
to an imbalance in the memory sizes needed by the pipeline stages. FIG.1 shows, for ©(ample, that some levels have 
no nodes while others have thousands. It should be fairly easy to split one level of a bonsai over several contiguous 
pipeline stages when the memory requirements of a single stage are insuffident, but it is still the case that the levels 
near the root will only require a very limited amount of memory. 

50 

IVE. A Network System using Bonsai 

[0091] FIG. 1 8 shows an implementation of a prefen^d embodiment of a network system according to the present 
invention. This networi« system comprises a plurality of hosts 18.10-18.13. Each host has routers 18.20-18.23 associ- 
55 ated with it Prefixes of addresses are stored in the routers using bonsai tries. Bonsai tries, as noted above, are imple- 
mentations of binary tries wherein prefixes related to addresses are storesuch that each node in the bonsai trie has a 
prefix stored and no node is empty. 

[0092] Other modifications and variations to the invention will be apparent to those skilled in the art from the fore- 
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going disclosure and teachings. Thus, while only certain enrtbodiments of the invention have been specificalty described 
herein, it will be apparent that numerous modifications may be made thereto without departing from the spirit and scope 
of the invention. 

5 Claims 

1 . A method of storing a set of prefixes related to a set of addresses, said method comprising storing the prefixes in 
a binary trie fashion wherein each node in said binary trie is associated with at least one of said prefixes and no 
node in said binary trie is empty. 

10 

2. The method of daim 1 wherein a first prefix is inserted into an empty trie by allocating a root node and placing said 
prefix in the root node. 

3. The method of claim 1 or 2, wherein a prefix other than a first prefix, comprising k-bits, with a representation bo, 
IS b^,...,b|(.i wherein k is an integer greater than 0, is inserted using a process comprising: 

a) designating a root node of said trie as a current node and b bg and the prefix as the current prefix; 

b) terminating insertion if the current node has the cun-ent prefix already stored in it; 

20 

c) examining the cun-ent node's left child, if b^ = 0 and examining the cun^ent node's right child if b^ = 1; 

d) allocating a new node and placing the cun-ent prefix if one of left child and right child do not exist, and des- 
ignating said new node as the cun^nt node; 

25 

e) assigning n=n+1 ; 

f) repeating steps b-e until n=k ; 

30 g) replacing a previously stored prefix in the cunrent node with the current prefix and designating the previously 

stored prefix as the cunBnt prefix and repeating steps b-g. 

4. The method of claim 1 or 2, wherein said trie is searched for an LMP of an address comprising k-bits, with a repre- 
sentation bo, bi,...,bi^.i wherein k is an irtteger greater than 0 using a process comprising: 

35 

a) designating a root node as current node as well as an LMP node if the root node has a matching prefix and 
b„=bo; 

b) designating current node as an LMP node and carrying said LMP node lower if the current node has a 
40 nnatching prefix and if the matching prefix is longer than the LMP node; 

c) designating the current node's left child as the current node, if b^ = 0 and designating the cument node's right 
child as the cun^nt node if bp = 1 ; 

45 d) n=:rvi-l ; 

e) repeating steps b-d until the current node is at a lowest level of the trie; and 

f) selecting a prefix con-esponding to said lowest trie as an LMP rf said prefix is a match. 

50 

5. The method of daim 1 , 2, 3, or 4, wherein a prefix corresponding to an address is deleted in said trie using a proc- 
ess comprising: 

a) searching for a matching node corresponding to said prefix; 

55 

b) deleting the matching node if the matching node is a leaf node and terminating the process; 

c) deleting said matching node and moving up one of said matching node's children if said matching node is 
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not a leaf node and deleting said one of said matching node's children; and 
d) repeating step b-c until a leaf node is deleted. 

6. The method of anyone of claims 1 to 5 wherein said trie is balanced for minimizing a depth in a worst-case search. 

7. A method of converting a simple trie with stored addresses into a depth-optimal sub-trie that has all nodes repre- 
senting addresses, said method comprising: 

a) finding a lowest level of said simple trie that has a full node and designating said lowest level as i, wherein i 
is an integer 

b) examining each node at a level con-esponding to i-1 ; 

c) moving up a prefix if there is an empty node at level i-1 from a bottom of the deeper subtrie of the empty 
node; and 

d) continuing said merging until the root node is reached. 

8. A method of converting a simple the, with stored addresses and known probabilities of visrting each node in said 
simple the, into a search-optimal the w'rth a minimum number of expected steps per search, said method using 
dynamic programming and said method comprising: 

a) calculating an an^y A^ for each node a using a bottom-up process such that AJi] holds a least expected 
number of search steps assuming i nodes are promoted out of a sub-trie with a as a root, wherein, 

Aa[i] = f(Ap,A^,Pp,P^) 

P and Y are the left and right children of oe; 

Pp and P^ represent the probabilrty that p and y are visited during a search assuming that a has been vis- 
ited; 

b) associating with each AJi] a number of prefixes that must be promoted from p and y to generate optimal 
subtries associated w'rth each Afjl]; and 

c) working recursively top-down from the root to issue requests to child nodes to promote prefixes up, the root 
node requesting 1 prefix if the root node does not hold a prefix, the root node requesting 0 prefix if the root 
node holds a prefix, said requests being based on the array A and the associated numbers of step b. 

9. The method of anyone of claims 1 to 8 wherein said addresses are Intemet addresses and said trie is located in an 
IP router. 

10. A networi<ing system comprising a plurality of routers, each router having an address storage, wherein in each 
address storage a set of prefixes related to network addresses conBsponding to the networtc system are stored in 
a fomi of a binary trie, said binary trie comprising a plurality of nodes wherein each node is associated with a prefix 
of at least one of said network addresses and no node is said binary trie is empty. 

11. The system of claim 10 wherein said binary trie is balanced for minimizing a depth in a worst-case search. 

12. A computer program product including a computer-readable medium, said program enabling one or more of com- 
puters associated with a networi<ing system to store a set of addresses stored in each router within said networidng 
system in a binary trie fashion wherein each node In said binary trie is associated with a prefix of at least one of 
said addresses and no node in said binary trie is empty. 

13. The computer-program product of claim 12 wherein said trie is balanced for minimizing a depth in a worst-case 
search. 
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14. A system for storing a set of addresses in a binary trie fashion wherein each node in said binary trie is associated 
with a prefix of at least one of said addresses, wherein said system comprises a pipeline, said pipeline further com- 
prising a plurality of stages, each stage from said plurality of stages corresponding to a level in said binary trie, said 
stage consisting essentially of a memory component, a bank of latches and a simple logic, said bank of latches 
storing a prefix, a destination IP address, a pointer pointing to an appropriate node, an instruction that indfcates the 
task of the corresponding stage, and a state containing information about a state of the instruction. 

15. The system of claim 14 wherein each stage of said pipeline comprises latches holding input and output infomnation, 
memory containing information con^esponding to nodes at a level, a stack containing pointers to unused node 
addresses and comparators. 
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